I’m excited for this course since I think I already learned a bunch of new things at the installation/setup phase, although I use these key tools of the course, R (studio) and GitHub, already on a daily basis. Since I’m very much self-learned, it will be intriguing to learn new things about these tools and how to use them in a more sensible manner perhaps. Especially learning more about the crosstalk between R (studio) and GitHub will be very useful for me, while I’m also excited to dive into the other contents of the course.
To be honest, I didn’t go over the R for Health Data Science in detail yet, just merely browsed but I think it will be a great source of information regarding the later parts of the course. I’m quite familiar with R basics and plotting in R (often with ggplot2), but to be honest, I rarely use other tidyverse -related packages like dplyr anymore since I’m pretty much self-learned R user and usually I find other ways to execute things which I find to be more intuitive than the dplyr commands. But maybe I’ll be converted to a fan after the course, who knows!
My GitHub page can be found at my IODS-project GitHub page.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Nov 28 13:56:20 2023"
date()
## [1] "Tue Nov 28 13:56:20 2023"
library(dplyr)
library(readr)
library(ggplot2)
library(GGally)
# reading in the learning 2014 dataset and saving it as a data frame in a variable called "data"
data <- as.data.frame(read_csv("/Users/sannimoi/Documents/courses/IODS/IODS-project/data/learning2014.csv", show_col_types = FALSE))
# checking the dimensions and structure of the data
dim(data)
## [1] 166 7
# and structure of the data
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : num 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num 25 12 24 10 22 21 21 31 24 26 ...
The original learning 2014 dataset was collected by Kimmo Vehkalahti in 2014-2015, which includes answers from 183 students to a variety of questions related to their learning and studying habits, as well as teaching on a course named Introduction to Social Statistics. The original dataset contained 60 variables and 183 observations (students), while the current dataset was processed using the original one and it includes 166 observations, and seven variables based on the dimensions and structure visible above. Checking the structure further reveals that the included variables are gender, age, attitude, deep, stra, surf, and points. While age and gender columns may be easy to understand, the rest of the variables may require a bit of clarification: attitude variable describes the global attitude towards statistics, deep variable is the average value from 12 questions related to deep learning, stra variable is the the average value from 8 questions related to strategic learning, surf variable is the the average value from 12 questions related to surface learning, and points refers to exam points. The deep, strategic and surface learning questions were all measured on the Likert scale (1-5, 1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5= strongly agree).
# printing a graphical overview of the data, coloring the distributions by gender
ggpairs(data, mapping = aes(col=gender), lower = list(combo = wrap("facethist", bins = 20)))
In the graphical summary above, we can appreciate the distributions of each variable in the data and the relationships between the different variables. The data points, distributions and correlations are coloured by gender (red for female and blue for male students). The plots including gender look a bit different from the others, since it is the only variable in the data that is not numeric.
Overall, the number of female students who participated in the course is almost double compared to male counterparts, which needs to be kept in mind when interpreting the results. But overall, based on the distributions, the female students seem to be a bit younger on average, their global attitude to statistics scores lower, while they record slightly higher values for the strategic and surface learning questions compared to their male counterparts on average. In contrast, the deep learning and exam point distributions seem quite similar between the genders.
Regarding the correlations between the different variables, the results differ quite a bit depending on the gender. For instance, interestingly, the values from surface learning questions seemed to correlate negatively with both attitude and deep learning questions, but these correlations seem to be very specific to the male participants. Moreover, age seemed to have quite minimal correlations with any of the other variables, yet the points of male participants seemed to be negatively correlated with exam points. However, no conrete conclusions can be made due to the evident low number of male students of higher age. Most notably, attitude seems to have the highest positive correlation with exam points regardless of the gender.
To explore further which variables seem to associate with the exam points variable, a regression model is fitted to the data with three explanatory variables. I chose attitude, stra and surf as the three explanatory variables since they seem to have the highest absolute correlation values with the exam points based on the summary plot above.
# creating a regression model with multiple explanatory variables: for exam points attitude, stra and surf
my_model2 <- lm(points ~ attitude + stra + surf, data = data)
# printing out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The model is essentially describing how well the combination of attitude, stra and surf variables predict the outcome variable, exam points. In the printed summary above, we first have the description of residuals which essentially describe the difference between the actual observed values and the values estimated by the model, including the minimal and maximal differences, 1st and 3rd quantile values, and the median difference, which is about 0.5 in this case, so quite close to zero, which is good since the residuals seem to be quite symmetrical.
The coefficients table, in turn, describes the coefficients of the model which are basically two unknown constants of the linear model, intercept and slope. So essentially based on these calculated terms and the explanatory variables, we could estimate the outcome variable, i.e., exam points. The p-values are visible in the last column, and they essentially describe the results from a significance test testing the null hypothesis that either the intercept or the slope is 0. The first row in the coefficients table describes the intercept variable, essentially describing the estimated exam points if the values for the explanatory variables would be the average ones in the dataset. For the intercept, we reach a p-value of 0.00322 which is significant. In addition, in the table we have the standard error and t-value columns, which values are related since t value is the coefficient divided by standard error. Standard error essentially captures standard deviation, and we would like that to be as small as possible compared to the coefficient since it describes the level of uncertainty of the coefficient. In contrast, for a confident model, we would like to see a t-value as large as possible since a large t-value would indicate that the standard error is small in comparison to the coefficient. Here, the standard error seems to be relatively high, about 3.7 compared to our coefficient of about 11, and therefore the t-value also remains smaller than 3.
Since we have multiple explanatory variables, we have also a row for each of them describing the slope constant estimates based on each, as well as standard errors, t-values and p-values for each of these. We can see that the surf parameter (surface learning) is the only one with seemingly negative correlation with exam points (negative slope), and for it we have highest standard error as well as the smallest t-value, while accordingly the p-value for it is nearly 0.5, meaning that the any relationship between it and exam points has an almost 50% chance of arising just by chance. Therefore, I chose to fit the model again without the surf variable.
# creating a regression model with multiple explanatory variables: for exam points attitude and stra
my_model2 <- lm(points ~ attitude + stra, data = data)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Based on the output, we can see that the p-value for the whole model is now 0.00025, so more significant than for the previous model, yet the p-value for the stra explanatory variable remains non-significant at 0.089 so therefore I will fit the model again without it.
# creating a regression model with one explanatory variable, attitude, for exam points
my_model2 <- lm(points ~ attitude, data = data)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
qplot(attitude, points, data = data) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Now we can observe from the summary, that the model has further improved and reached a much more significant p-value for the intercept at 1.95e-09 and for the slope the p-value is almost as significant at 4.12e-09. Moreover, the standard error is now smaller and t-value has also improved as it is over 6 for both intercept and slope.
Also, I plotted a scatterplot above to further describe the relationship of the two variables, the blue line capturing the estimated regression model. We can see that higher the score on the attitude scale, i.e., the explanatory variable, seems to positively correlate with the exam points, i.e., the outcome variable. This positive correlation can also be seen in the summary since the slope constant for the model is > .
Moreover, from the summary we can check the residual standard error which describes the quality of the fit, and basically this value describes the error term that every linear model is expected to contain. For the present model, it seems to be 5.32, resulting in about 46% percentage error. The degrees of freedom describe the data points used in the model fit, here 164. The F statistic, in turn, is a good indicator of whether there is a relationship between the predictor and response variables, and the following p-value is derived essentially from a test where the null hypothesis is that there is no relationship. The F statistic has steadily gotten bigger and p-value smaller when removing the surf and stra explanatory variables, so the model has improved. Since we have a relatively big dataset, we can at least reject the null hypothesis, i.e., that there is no relationship between the variables, for all the models, and most confidently for the last model only including attitude as an explanatory variable.
The R-squared statistic in the model describes how well the model fits the actual observed data, essentially depicting the proportion of variance and measuring the linear relationship between the predictor variable, attitude, and target variable, exam points. Basically in the present model, roughly 19% of the variance of the exam points variable can be explained by the attitude score.
The linear regression model includes some assumptions about the data, so next we will plot a few diagnostics plots to interpret if those assumptions apply to the present dataset.
# generating selected diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
plot(my_model2, which=c(1,2,5))
The linear model assumes a linear relationship between the explanatory and outcome variables, and this assumption can be checked from the residuals vs fitted plot. There seems to be a linear horizontal line without distinct patterns, so indeed the assumption of a linear relationship seems to apply here.
Secondly, the residual errors are assumed to be normally distributed, which assumption can be evaluated based on the normal Q-Q plot. The residuals points should follow the straight dashed line, which seems to apply here relatively well at least for majority of values. However, some discrepancy can be seen for very small and high values, which is quite normal, though.
Furthermore, the model assumes that residuals have a constant variance, and hence we would like to check that there are no cases in the dataset that have extreme values that might highly affect the regression results if excluded/included. This can be evaluated based on the residuals vs leverage plot, in which the 3 most extreme points are highlighted. We have a few cases which seem to exceed 3 standard deviations which represent potential outliers, and it might be useful to test the model fit without these.
Based on these observations, it can be concluded that the model assumptions apply relatively well to the present dataset.
In summary, we can confidently say that there seems to be a relationship between the attitude and exam variables since we get a very significant p-value for the model. However, the p-value should not to be interpreted in such a way that the estimation of exam points would be highly confident if we know the attitude score, since we have to acknowledge that only less than 20% of the variability of the exam points variable is explained by the attitude.
date()
## [1] "Tue Nov 28 13:56:27 2023"
library(boot)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
Reading in the joined student alcohol consumption dataset.
# reading in the dataset
alc <- read.csv("/Users/sannimoi/Documents/courses/IODS/IODS-project/data/alc.csv")
# checking the names of the variables, i.e., columns
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The current dataset was retreived from http://www.archive.ics.uci.edu/dataset/320/student+performance and it contains information about performance of Portuguese secondary education students in two subjects, math and Portuguese. For the present purpose, we have combined the two datasets, and read that modified data in. There are several variables that can be easily understood based on the column name, like school, sex, age, address etc. But especially Dalc and Walc variables are important here, since they describe alcohol consumption on the workdays and weekends, respectively, while alc_use is the mean value of the two and high_use column is set to TRUE if alc_use exceeds 2.
The purpose of this analysis is to study the relationships between high/low alcohol consumption and some other variables in the data. I have chosen the following variables to study:
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
absences - number of school absences (numeric: from 0 to 93)
I hypothesize that weekly study time and and quality of family relationships would negatively correlate with alcohol consumption, i.e., that a higher value for these factors would usually associate with lower alcohol consumption. Regarding the number of school absences and past class failures, I assume the opposite, i.e., that there would be a positive correlation between higher alcohol consumption and higher number of absences/failures.
# drawing a bar plot of each chosen variable
gather(alc[,c("failures", "studytime", "famrel", "absences", "alc_use")]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
# and then summarizing the number of students that have been classified to the high alcohol usage group
alc %>% group_by(high_use) %>% summarise(count = n())
## # A tibble: 2 × 2
## high_use count
## <lgl> <int>
## 1 FALSE 259
## 2 TRUE 111
In general, based on the plots and table above, we can observe that most students have a reasonable number of absences, only a few standing out with more than 15, and also that vast majority of students haven’t failed any classes in the past. Majority of students also fall into the category of low alcohol consumption (high use = FALSE & less than 2 portions based on the barplot) yet still about 30% have been classified into the high alcohol use group. Most students also report good/excellent quality family relationships, while study time per week, in turn, seems to average around 2-5 hours per week, as category 2 stands out as the most common one there, studying less than that being the reality for about 100 students, while less than 100 report studying more than 5 hours.
# creating cross-tabulations describing the numbers in each candidate explanatory variable category and the high/low alcohol usage category
table(high_use = alc$high_use, prediction = alc$failures)
## prediction
## high_use 0 1 2 3
## FALSE 238 12 8 1
## TRUE 87 12 9 3
table(high_use = alc$high_use, prediction = alc$studytime)
## prediction
## high_use 1 2 3 4
## FALSE 56 128 52 23
## TRUE 42 57 8 4
table(high_use = alc$high_use, prediction = alc$famrel)
## prediction
## high_use 1 2 3 4 5
## FALSE 6 9 39 128 77
## TRUE 2 9 25 52 23
table(high_use = alc$high_use, prediction = alc$absences)
## prediction
## high_use 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 18 19 20 21 26 27
## FALSE 50 37 40 31 24 16 16 9 14 5 5 2 3 1 1 0 0 1 0 2 1 0 0
## TRUE 13 13 16 7 11 6 5 3 6 6 2 3 4 1 6 1 1 1 1 0 1 1 1
## prediction
## high_use 29 44 45
## FALSE 0 0 1
## TRUE 1 1 0
# plotting boxplots to study the relationship between the alcohol usage (high or not) and the chosen variables
# drawing boxplots comparing values between the high vs low alcohol usage groups
ggplot(alc, aes(x = high_use, y = failures, col=high_use)) + geom_boxplot() + ggtitle("Past class failures") + xlab("High alcohol use")
ggplot(alc, aes(x = high_use, y = studytime, col=high_use)) + geom_boxplot() + ggtitle("Study time") + xlab("High alcohol use")
ggplot(alc, aes(x = high_use, y = famrel, col=high_use)) + geom_boxplot() + ggtitle("Quality of family relationships") + xlab("High alcohol use")
ggplot(alc, aes(x = high_use, y = absences, col=high_use)) + geom_boxplot() + ggtitle("Absences from school") + xlab("High alcohol use")
From the tables and plots above, we can get some idea whether the chosen variables seem to have any relationship with the high alcohol consumption. As I assumed, the students reporting high alcohol consumption study less on average based on study time and report family relationships of lower quality on average. Also, as I hypothesized, the number of absences from school seemed to be higher on average in the group with high alcohol use. The relationship of failed classes, however, is difficult to evaluate based on the boxplot, since so few students had previously failed classes, yet some indication of a correlation can be observed from the table, as more students that have 2 or more failed classes also consume high amount of alcohol.
To statistically explore the relationship between the chosen explanatory variables and the high/low alcohol consumption target variable, a logistic regression model will be fitted.
# fitting the model, using failres, studytime, famrel and absences as the explanatory variables
model <- glm(high_use ~ failures + studytime + famrel + absences, data = alc, family = "binomial")
# printing the summary
summary(model)
##
## Call:
## glm(formula = high_use ~ failures + studytime + famrel + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0531 -0.7954 -0.6253 1.0826 2.1737
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.60734 0.61872 0.982 0.326292
## failures 0.49010 0.20416 2.401 0.016370 *
## studytime -0.48901 0.15962 -3.064 0.002187 **
## famrel -0.24724 0.12916 -1.914 0.055598 .
## absences 0.07417 0.02246 3.303 0.000958 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 409.94 on 365 degrees of freedom
## AIC: 419.94
##
## Number of Fisher Scoring iterations: 4
# then printing the odds ratios of the coefficients
exp(coef(model))
## (Intercept) failures studytime famrel absences
## 1.8355483 1.6324760 0.6132323 0.7809515 1.0769908
# and then printing the confidence intervals for the coefficients
confint(model)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -0.60724157 1.826969798
## failures 0.09542278 0.902367559
## studytime -0.81145054 -0.184130907
## famrel -0.50161742 0.006600938
## absences 0.03228991 0.120768107
Looking at the summary of the logistic regression model, we can interpret the residuals similarly to linear regression, as they essentially describe the difference between the actual observed values and the values estimated by the model, including the minimal and maximal differences, 1st and 3rd quantile values, and the median difference, which is about -0.6253 in this case. Moreover, the coefficients table describes the estimated intercept term and the slope terms for each explanatory variables, while p-values for each are also given. The p-value for the intercept remains quite high at 0.33, while p-values for all the slope terms except famrel are significant. The null deviance describes the value when only taking intercept term into account, while residual deviance is the value when all variables are taken into account, so the higher the difference, better the model. The difference doesn’t seem very notable in this case. The AIC, in turn, would be useful in comparing the fit of multiple regression models, as it could help to find the model that explains the most variation in the data, the lower the AIC the better.
The odds ratios can essentially be interpreted as the percentage increase in the odds of an event, so basically anything less than zero have a negative association, so here study time and quality of family relationships, so if they increase, they decrease the odds of high alcohol consumption, while the odds ratios for failures and absences are more than one, and thus if they increase, also increases the probability of high alcohol consumption. Odds ratio of 1 would correspond to 0.5 probability, so e.g., the odds ratio of failures corresponds to 1.632/(1+1.632)=0.62 probability, while study time would correspond to 0.613/(1+0.613)=0.38 probability, essentially meaning how much would the probability of high alcohol consumption increase/decrease if the value of the explanatory variable is increased by one unit. Based on these odds ratios, all of the explanatory variables seem to associate with alcohol consumption as I assumed, studytime and famrel correlating negatively, while failures and absences had positive correlation with alcohol consumption.
The confidence intervals, in turn, describe the range of estimated values for a parameter, and basically here we can state that we can be 95% confident that the slope for each explanatory variable is between the intervals visible above. E.g., it is likely that the slope for failures id between 0.095 and 0.902, which seems to be quite a wide scale, also visible in the p-value for the slope (0.016370), which is higher compared to e.g. absences, although the odds ratio for absences is not so high as for failures. For absences, in turn, the confidence interval is between 0.032 and 0.121, and the p-value is highly significant at 0.000958, which highlights that to explore the relationship between an explanatory and outcome variable, one should check both the odds ratio and the confidence estimates of the relationship between the two variables, like the p-value and confidence intervals.
Since famrel variable, did not have a significant relationship to alcohol consumption, removing that from the model and testing the predictive power of the improved model next.
# since quality of family relationships did not seem to have that significant relationship with alcohol consumption, fitting the model without it
model2 <- glm(high_use ~ failures + studytime + absences, data = alc, family = "binomial")
# predicting the probability of high_use based on the model
probabilities <- predict(model2, type = "response")
# adding the predicted probabilities to alc
alc <- mutate(alc, probability = probabilities)
# using the probabilities to make a prediction of high_use
# classifying probability > 0.5 as a cutoff for high use
alc <- mutate(alc, prediction = probability > 0.5)
# generating a 2x2 cross tabulation
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 246 13
## TRUE 86 25
# calculating for how many students the prediction is wrong
n_wrong <- 0
for (i in 1:nrow(alc)) {
if (alc$high_use[i]!=alc$prediction[i]) {n_wrong <- (n_wrong+1)}}
n_wrong
## [1] 99
# since there are 370 students in total, the training error can be calculated n_wrong/370
n_wrong/370
## [1] 0.2675676
# then checking how well would random predictions work
# doing the following steps 100 times to derive an estimate of how well a random classification performs compared to our model
random_errors <- vector()
for (j in 1:100) {
# sampling the high use column to derive a new randomized column with the same ratio of TRUE/FALSE values
alc["high_use_random"] <- sample(alc$high_use)
# then calculating how many wrong predictions we get from this random classification
n_wrong <- 0
for (i in 1:nrow(alc)) {
if (alc$high_use[i]!=alc$high_use_random[i]) {n_wrong <- (n_wrong+1)}}
# since there are 370 students in total, the training error can be calculated n_wrong/370
random_errors <- append(random_errors, n_wrong/370)
}
# then checking the mean, median and range of the error rate of these randomized classifications
mean(random_errors)
## [1] 0.4206486
median(random_errors)
## [1] 0.4216216
range(random_errors)
## [1] 0.3783784 0.4594595
Based on the model, we can say pretty confidently that the chosen variables in the final model, weekly study time, number of past class failures and school absences, seem to have a relationship with the outcome variable, alcohol consumption. However, the predictive power of the model, i.e., predicting alcohol consumption based on the variables, is by no means ideal, since the classification to high/low consumption group fails almost 27% percent of the time (= training error), which is still somewhat better than what you would get by grouping the students by random (on average 42% error rate from 100 randomizations) yet far from something we would call a confident predictive model.
date()
## [1] "Tue Nov 28 13:56:29 2023"
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(MASS)
library(corrplot)
library(GGally)
# loading in the Bostson data from the MASS dataset
data("Boston")
# checking the dimensions and structure
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The Boston dataset contains information about housing values in the suburbs of Boston, and there are in total 506 observations (rows in the data frame) and 14 variables (columns in the data frame), most of the variables being reported in numerical values. These variables include various pieces of information about the studied suburbs, like the per capita crime rate (crim), average number of rooms per dwelling (rm), median value of owner-occupied homes in $1000s (medv) etc. All the variables are explained here.
# for plotting purposes, changing the chas variable into TRUE/FALSE scale
# 1 -> TRUE (i.e., suburb tract bounds river)
# 0 -> FALSE (i.e., suburb tract does not bound river)
Boston2 <- gsub(1, "TRUE", Boston[,4])
Boston2 <- gsub(0, "FALSE", Boston2)
Boston["chas"] <- Boston2
# printing out the data summary
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Length:506
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 Class :character
## Median : 0.25651 Median : 0.00 Median : 9.69 Mode :character
## Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# printing a graphical overview of the data
ggpairs(Boston, lower = list(continuous = wrap("points", alpha = 0.3, size=0.1), combo = wrap("dot", alpha = 0.3, size=0.1)), upper=list(continuous=wrap("cor", size=2.1))) + theme(axis.text=element_text(size=5), axis.title = element_text(size=5))
In the overview above, we can see that some variable distributions, like crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq.ft.), and dis (weighted mean of distances to five Boston employment centres) are highly skewed to the left, while some variables, like indus (proportion of non-retail business acres per town) and tax (full-value property-tax rate per $10,000) seem to have two distinct peaks. Some variables, like rm (average number of rooms per dwelling) seem to be more normally distributed. These observations seem reasonable, since it is quite expected that e.g., only a small minority of residential lots would be very large in a big city like Boston (skewness of the zn variable), while it is more likely that proportion of non-retail business acres per town would be quite town specific and those towns with a lot of industrial buildings, for instance, would have minor residential buildings and vice versa (two peaks of the indus variable). Moreover, it is expected that variables like average number of rooms per dwelling would be quite normally distributed since extremes (very low or high number of rooms) are unreasonable for majority of residents, while the numbers closer to the average accumulate as the peak. Regarding the only non-numerical chas variable, describing if the suburb tract bounds river or not, the vast majority of suburbs do not seem to be river-bound, which is quite expected.
Regarding the relationships between the variables, it seems that all of the numerical variables highly correlate with each other, the strongest association being the positive correlation between tax (full-value property-tax rate per $10,000) and rad (index of accessibility to radial highways), which seems intuitive that the properties closest to the highways have the highest tax rate. Regarding the only non-numerical variable, chas, describing if the suburb tract bounds river or not, there are no direct correlation values but some trends can be distinguished from the plots, e.g., that there seem to be less suburbs with higher crime rate while the age of the properties seem to be on average higher compared to non-river bound counterparts, which observations seem intuitive since the suburbs closest to the river represent most likely the (historical) downtown of the city with high-value homes.
Plotting below also a visualization of the correlation matrix to allow for a bit closer review of the correlations between the variables.
# changing the chas variable back to original 1/0 values
# TRUE -> 1 (i.e., suburb tract bounds river)
# FALSE -> 0 (i.e., suburb tract does not bound river)
Boston2 <- gsub("TRUE", 1, Boston[,4])
Boston2 <- gsub("FALSE", 0, Boston2)
Boston["chas"] <- as.numeric(Boston2)
# calculating the correlation matrix
cor_matrix <- cor(Boston)
# and visualizing it
corrplot(cor_matrix, method="circle")
In the correlation matrix visualizations we can distinguish the highest correlations between the variables in the dataset, the positive ones being visualized in blue and negative correlations in red. For instance, nox (nitrogen oxides concentration) correlates negatively with dis (weighted mean of distances to five Boston employment centres) and positively with indus (proportion of non-retail business acres per town), which is quite expected since pollution (like nitrogen oxides) is likely to be highest in the industrial areas, which also likely represent many of the employment centers.
Next, to standardize the dataset, we will scale the values, i.e., subtract the column means from the corresponding columns and divide the difference with standard deviation.
# centering and standardizing variables
boston_scaled <- as.data.frame(scale(Boston))
# summarizing the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
By scaling, we have essentially centered the values around 0, so the mean of each variable is now 0. Next, we will create a categorical variable of the crime rate, generating 4 classes based on quantile values: low, intermediate low, intermediate high and high.
# determining the quantiles of the crim (crime rate per capita) variables
quants <- quantile(boston_scaled$crim)
# creating a categorical variable called 'crime'
crime <- cut(boston_scaled$crim, breaks = quants, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# and checking the crime variable distribution
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# then removing original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# and adding the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Next, we are determining the train and test sets for linear discriminant analysis, so that 80% of observations are in the train set and 20% in the test set.
# dividing the data into train and test sets
set.seed(14)
# 80% allocated to the train set
train <- sample(nrow(boston_scaled), round(0.8*506))
# and the rest, 20% to the test set
test <- setdiff(1:nrow(boston_scaled), train)
# creating tables with only the train and test sets
train <- boston_scaled[train,]
test <- boston_scaled[test,]
Next, we are fitting the linear discriminant model based on the train set, and using it to predict the crime category for the test set.
# conducting the linear discriminant analysis for the train set
# using crime as the outcome variable and all other variables as the predictor variables
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2395062 0.2567901 0.2567901 0.2469136
##
## Group means:
## zn indus chas nox rm age
## low 0.88018684 -0.9415937 -0.10997442 -0.8527319 0.42189046 -0.8760001
## med_low -0.07228021 -0.2786005 -0.04518867 -0.5769003 -0.16683471 -0.4154389
## med_high -0.37922279 0.1835589 0.18195173 0.4087296 0.07424198 0.4046517
## high -0.48724019 1.0149946 -0.03610305 1.0491662 -0.47816139 0.7977622
## dis rad tax ptratio black lstat
## low 0.7912327 -0.6917945 -0.7302380 -0.43374706 0.38169387 -0.75045462
## med_low 0.4108086 -0.5434660 -0.4506348 -0.05185552 0.31564889 -0.13900995
## med_high -0.4055865 -0.4142635 -0.3135958 -0.30679176 0.07832721 0.02678365
## high -0.8417971 1.6596029 1.5294129 0.80577843 -0.74492067 0.89683260
## medv
## low 0.49143415
## med_low -0.03712692
## med_high 0.14269557
## high -0.67834465
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.14094159 0.583571762 -1.07527346
## indus 0.04862218 -0.388509504 0.47793431
## chas -0.08648985 -0.008871757 0.10195962
## nox 0.26003892 -0.865688531 -1.15397700
## rm -0.09711108 -0.044237317 -0.12791697
## age 0.23864839 -0.360282856 -0.07065071
## dis -0.14600415 -0.308393676 0.57267244
## rad 3.41164504 0.713321674 -0.18377522
## tax -0.03355981 0.309963183 0.56766834
## ptratio 0.13083968 0.053636214 -0.26691521
## black -0.14812971 0.041315399 0.11953790
## lstat 0.14592880 -0.136928628 0.40634740
## medv 0.12599482 -0.366917135 -0.10819431
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9515 0.0350 0.0135
# defining the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0, x1 = myscale * heads[,choices[1]], y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3)}
# defining target classes as numeric, 1 - low, 4 - high
classes <- as.numeric(train$crime)
# plotting the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
# first determining the correct crime categories of the test set
correct_cat <- test$crime
# and removing the variable from the table
test <- dplyr::select(test, -crime)
# predicting classes of the test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_cat, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 10 1 0
## med_low 3 15 4 0
## med_high 0 8 13 1
## high 0 0 1 26
After fitting the model with the train data, and applying it to the test data, we can see that 19+15+13+26=73 categories have been correctly predicted, resulting in a success rate of 73/101=0.72. Most of the wrong predictions are also only 1 category “away” from the correct one, so i.e., one category lower or higher. Only 1 prediction is two categories away from the correct one; the one for which the correct category is low, but prediction is med_high. Therefore, it seems that the model is working relatively well to predict the correct crime category, and if the prediction fails, it is rarely very far from the accurate one. However, regarding the present, relatively small dataset, it can have a high impact on the model which observations are chosen for the train set and test set, and whether the proportions of the different categories represented in both train and test sets represent the actual data well.
Next, the Boston housing dataset will be reloaded and standardized, and distances between the observations calculated.
# loading in the Bostson data from the MASS dataset
data("Boston")
# centering and standardizing variables
boston_scaled <- as.data.frame(scale(Boston))
# creating a distance matrix using the Euclidean distance
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# as well as creating a distance matrix using the Manhattan distance
dist_eu <- dist(boston_scaled, method="manhattan")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Next, to further study the similarity of the observations, k-means clustering will be conducted. The unsupervised method will assign the observations into clusters based on their similarity.
set.seed(14)
# conducting the k-means clustering, starting with 4 clusters
km <- kmeans(boston_scaled, centers = 4)
# visualizing the dataset with clusters indicated by colours
ggpairs(boston_scaled[c("crim", "zn", "indus", "nox", "age", "dis", "rad", "tax", "lstat", "medv")], mapping = aes(col=factor(km$cluster), alpha=0.3), lower = list(combo = wrap("dot", alpha=0.3, size=0.1)), upper=list(continuous=wrap("cor", size=2.1))) + theme(axis.text=element_text(size=5), axis.title = element_text(size=5))
It seems that 4 clusters may be a bit too much since the distributions of the variables by clusters seem to overlap quite notably in many of the plots. Therefore, testing out division to 2 and 3 clusters next.
set.seed(14)
# conducting the k-means clustering with 2 clusters
km <- kmeans(boston_scaled, centers = 2)
# visualizing the dataset with clusters indicated by colours
ggpairs(boston_scaled[c("crim", "zn", "indus", "nox", "age", "dis", "rad", "tax", "lstat", "medv")], mapping = aes(col=factor(km$cluster), alpha=0.3), lower = list(combo = wrap("dot", alpha=0.3, size=0.1)), upper=list(continuous=wrap("cor", size=2.1))) + theme(axis.text=element_text(size=5), axis.title = element_text(size=5))
set.seed(14)
# conducting the k-means clustering with 3 clusters
km <- kmeans(boston_scaled, centers = 3)
# visualizing the dataset with clusters indicated by colours
ggpairs(boston_scaled[c("crim", "zn", "indus", "nox", "age", "dis", "rad", "tax", "lstat", "medv")], mapping = aes(col=factor(km$cluster), alpha=0.3), lower = list(combo = wrap("dot", alpha=0.3, size=0.1)), upper=list(continuous=wrap("cor", size=2.1))) + theme(axis.text=element_text(size=5), axis.title = element_text(size=5))
Division to 2 clusters seemed to work well regarding some variables, like indus and taxt as there seems to be 2 quite distinct peaks. However, for some variables, the distributions per cluster are quite wide, like lstat and medv, which may indicate that division into 2 clusters might not optimally capture the complexity of the data. When dividing the data into 3 clusters, in turn, the clusters seem to quite nicely be distinguishable from each other regarding many variables, like nox, and e.g., the distributions of medv and lstat are not as wide anymore. Therefore, it seems that 3 clusters may be the best choice out of the different number of clusters.
Looking at the last plot with 3 clusters more closely, we can see that blue cluster seems quite distinct in many aspects, i.e., it seems to have clearly the highest indus (proportion of non-retail business acres per town), nox (nitrogen oxides concentration), rad (index of accessibility to radial highways) and tax (full-value property-tax rate per $10,000) values on average, while the dis (weighted mean of distances to five Boston employment centres) and medv (median value of owner-occupied homes in $1000s). All of these observations seem to indicate that these areas cluster together due to many factors that indicate them being more industrial as these suburbs are more likely to e.g., have a lot of industrial buildings, and higher pollution, while they seem to represent the areas closest to the employment centers since the industrial areas likely hold a lot of jobs and access to highways is likely important to the industries too. The high tax rate might also indicate a lot of industrial use of the buildings as there are likely bigger institutions that pay more tax for the bigger properties than the average home owner. Higher crime rates also seem to accumulate to this cluster, likely indicating that the residential areas of these suburbs are inhabited more likely by people with lower education level on average (higher lstat) and likely therefore lower income.
Regarding the other two clusters, they seem to represent less-industrial areas. As the green cluster is resembles the blue one a bit more compared to the red, I would interpret that the green cluster likely represents areas closer to downtown Boston while the red cluster represents more distant suburbs. These clusters do not seem to differ so much in terms of crime rate or tax. However, the red cluster seems to have the highest number of large residental properties by area (zn, proportion of residential land zoned for lots over 25,000 sq.ft), lowest proportion of non-retail business acres per town (indus) on average, lowest nox (nitrogen oxides concentration) and highest dis (weighted mean of distances to five Boston employment centres), all suggesting that the areas are likely further away from the heart of the city. Also, the percentage lower status of the population and the age of the homes seem the lowest in these areas, while the median value of homes seems to be the highest on average, also indicating that perhaps these areas more likely house the upper class since the properties are on average newer and bigger and owned by people more highly educated. The green cluster, in turn, seems to be the ‘middle one’ so likely represents on average the non-industrial suburbs closer to downtown Boston, with e.g., pollution higher compared to the red cluster yet lower compared to the blue one, but on average with smaller distance to the employment centers (dis) compared to the red cluster.
All in all, these 3 clusters seem to represent well 3 distinct classes of Boston suburbs, and comparing the distributions also allows one to hypothesize about the types of areas these clusters likely represent on average.
date()
## [1] "Tue Nov 28 13:57:09 2023"
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(MASS)
library(corrplot)
library(GGally)
library(tibble)
library(FactoMineR)
library(factoextra)
# reading in the human development index dataset, which has been processed from the data obtained originally from United Nations Human Development Reports
human <- read_csv("data/human.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# the first column of the table includes the country, so defining them to be the row names
human <- column_to_rownames(human, "Country")
# checking the structure
str(human, show_col_types = FALSE)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : num 64992 42261 56431 44025 45435 ...
## $ Mat.Mor : num 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
There is information for 155 countries in the present dataset, and the columns include 8 different variables related to the human development index, including the ratio of female vs male population with at least secondary education and labour force participation rate (Edu2.FM and Labo.FM, respectively), life expectancy (Life.Exp), expected years of schooling (Edu.Exp), gross national income per capita (GNI), maternal mortality ratio (Mat.Mor), adolecent birth rate (Ado.Birth) and percentange of female representatives in parliament (Parli.F).
# printing out the data summary
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# printing a graphical overview of the data
ggpairs(human, lower = list(continuous = wrap("points", alpha = 0.3, size=0.1), combo = wrap("dot", alpha = 0.3, size=0.1)), upper=list(continuous=wrap("cor", size=3))) + theme(axis.text=element_text(size=5), axis.title = element_text(size=5))
Based on the summary and graphical overview of the data, we can observe that the distributions of the variables seem to differ notably. For instance, the only one that seems to somewhat follow a normal distribution is expected years of schooling Edu.Exp, i.e., the value for majority of countries is close to the mean one, and the extremes are expectedly rare. The distributions of ratio of female vs male population with at least secondary education (Edu2.FM) and ratio of female vs male population labour force participation rate (Labo.FM) as well as life expectancy (Life.Exp) seem to be skewed to the left, meaning that majority of observations have medium/high values. Given the general state of the human development, and e.g., improved healthcare in general, it seems intuitive that life expectancy is high in general, with some exceptions in the least developed countries. Furthermore, it seems rational that higher values in the ratios of females vs males with higher education and taking part in the labour force seem to be more prevalent, as more and more countries are trying to improve the gender equality. However, the peak of the education ratio is more distinct which is unsurprising since the female labour force rate is expected to be somewhat lower in general due to pregnancies and maternity leaves also affecting women with higher education. The distributions of gross national income per capita (GNI), maternal mortality ratio (Mat.Mor), adolecent birth rate (Ado.Birth) and percentange of female representatives in parliament (Parli.F) all seem to have some degree of skewness to the right, in turn. Given historical facts like colonization, it is not surprising that the peak of GNI is closer to the lower end of values, while a lower number of highly developed countries represent the tail to the right, with some extremes also visible likely due to oil deposits. Luckily, maternal mortality rates and adolecent birth rates also have more distinct peaks at the lower end of values likely due to improved healthcare efforts and improved education of girls. The percentage of female representatives in parliament, has the highest peak around 15% which is not ideal yet somewhat expected since the higher education of females and participation in the labour force can be quite new improvements in many countries, and the ratio will likely increase with some delay. However, the ratio is usually not 50-50 in the more developed countries either.
# also calculating the correlation matrix
cor_matrix <- cor(human)
# and visualizing it
corrplot(cor_matrix, method="circle")
The correlations in the graphical overview and the plot above seem quite intuitive, like life expectancy positively correlating with expected years of education and negatively with maternal mortality rates, which has the highest correlation value in the dataset overall. The GNI, in turn, positively correlates with higher life expectancy and years of education, and negatively with maternal mortality and adolecent birth rate. Moreover, the expected years of education and ratio of females vs males with higher education correlate positively as would be expected. Interestingly, the ratio of females vs males taking part in the labour force has minimal correlations with the other factors, correlating positively with percentange of female representatives in parliament yet surprisingly also with maternal mortality rates.
Since the current dataset includes a lot of variables, it might be difficult to interpret as a whole, and therefore utilization of dimensionality reduction techniques may be useful to increase the interpretability while losing minimal information. To this end, principal component analysis (PCA) will be utilized next.
# performing principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# drawing a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex=0.6)
Based on the PCA results visualized above, we can observe that it it still quite difficult to interpret the dataset as a whole, since the PC1 seems to be highly affected by GNI only, visualized as the long horizintal arrow (high standard deviation), while the impacts of the other variables are undistinguishable. Qatar is a very extreme outlier regarding GNI, which seems to be the only variable of the dataset this PCA is able to take properly into account and which seems to pretty much solely represent the PC1 component. Therefore, scaling the data prior to PCA might make the analysis more insightful.
# scaling the data
human_scaled <- scale(human)
# performing principal component analysis on the scaled data (with the SVD method)
pca_human_scaled <- prcomp(human_scaled)
# drawing a biplot of the principal component representation and the original variables
biplot(pca_human_scaled, choices = 1:2, cex=0.6)
The PCA on the scaled data seems to capture the overall variability a lot better compared to the analysis on the unscaled data since the impact of the extreme GNI of Qatar is now minimized and the other variables are also contributing to the principal components and the principal component contributions also correspond to the previously observed correlations. The results are therefore a lot more insightful, and it seems that all the variables now affect the first two principal components to quite similar degree as the arrows are of fairly similar length. Ratio of female vs male population labour force participation rate (Labo.FM) and percentange of female representatives in parliament (Parli.F) seem to have the highest impact on PC2 and the rest contributing to the PC1 more. Countries with higher ratio of female vs male population labour force participation rate and percentange of female representatives in parliament (Parli.F) locate higher up in the plot, i.e., get higher values in the PC2, and such countries include e.g., Rwanda and Mozambique but also Iceland and Norway, which differ then in their location on the x-axis, i.e., PC1. Countries locating to the lower end of PC2 include Iran and Yemen, which represent countries with likely the most restricted women’s rights and independence, and therefore the female contribution in the parliament and labour force is likely minimal as well.
Maternal mortality ratio (Mat.Mor) and adolecent birth rate (Ado.Birth), in turn, seem to have a high impact on PC1, namely that higher the rate, higher the value on PC1. In contrast, ratio of female vs male population with at least secondary education (Edu2.FM), life expectancy (Life.Exp), expected years of schooling (Edu.Exp), and gross national income per capita (GNI) seem to have the opposite effect on PC1. Countries locating towards the lower end of PC1 include Nordic countries and e.g., Australia and Switzerland, which is quite expected since these developed countries have highest education rates, highest incomes, and most comprehensive healthcare systems on average. Qatar, with the clearly highest GNI is also distinguishable in the lower end of PC1, but also locates to the lower end on PC2, i.e., likely has lower fraction of women in parliament and labour force. Countries locating to the higher end of PC1 include e.g., Mosambique, Sierra Leone and Niger, which is unsurprising since these represent less developed countries in terms of education and income, while the healthcare is likely lacking in many aspects also resulting in higher level of maternal mortality.
In general, the more developed countries in terms of education, impact and gender equality seem to locate on average to the higher end of PC2 and lower end of PC1 so the upper left corner of the biplot, also observable from the variables that contribute to each principal component. In addition, the less developed countries seem to be quite variable regarding the participation of women in labour force and parliament, since e.g., Rwanda has more women in parliament and labour force yet the maternal mortality and adolecent birth rate are relatively high, while GNI is relatively low. In contrast, Iran has minimal percentage of women in parliament and low ratio of female vs male population labour force participation rate, while the country does better in terms of maternal mortality rate. Moreover, Qatar has clearly the highest GNI, lower maternal mortality and adolecent birth rates, and on average more women have higher education than men, yet no women are in the parliament.
To explore dimension reduction in the context of qualitative data, we will next analyze a dataset related to tea consumption with Multiple Correspondence Analysis (MCA).
# reading in the dataset, defining the character variables as factors
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea_time.csv", stringsAsFactors = TRUE)
# exploring the dimensions and structure
str(tea)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
View(tea)
It seems that the current dataset contains information about the preferred tea (Earl Gray/black/green), how the tea is consumed (alone/lemon/milk/other and tea bag/unpackaged/tea bag+unpackaged), does one use sugar in their tea (No.sugar/sugar), where does one purchase the tea from (chain store/chain store+tea shop/tea shop), and whether one has tea with lunch (lunch/Not.lunch). There are in total 300 observations, so 300 people have reported their preferences, and 6 variables as listed.
# visualizing the data
gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x=element_text(angle=45, hjust=1))
Visualizing the data reveals that majority of people prefer Earl Grey tea, a tea bag, and their tea alone and purchased from a chain store. Vast majority reported that they do not drink tea over lunch, while the ratio of sugar vs non-sugar users was almost equal, only a few more people reporting not using sugar with their tea.
# then conducting the multiple correspondence analysis
mca <- MCA(tea, graph = FALSE)
# plotting the MCA biplot, as well as variable plots
fviz_mca_biplot(mca, repel = FALSE, ggtheme = theme_minimal(), col.ind = c("powderblue"), col.var=c("tomato3"))
fviz_mca_var(mca, repel=TRUE, ggtheme=theme_minimal(), col.var = c("tomato3"))
fviz_mca_var(mca, repel = FALSE, ggtheme = theme_minimal(), choice="mca.cor", col.var = c("tomato3"))
In the biplot above, we can see how the individuals locate on the two-dimensional space and to which variable categories they seem to associate to the most. These variable categories are also plotted solely on the second plot. In particular, it seems that the individuals clustering towards the lower right corner (higher values of Dim1 and lower values of Dim2) would associate with consumption habits of mostly unpackaged tea purchased from a tea shop. The unpackaged and tea shop categories are the furthest from the origin (0,0) so they seem to be the most discriminating variable categories. In general, it seems that with the limited variables and categories in the present data, the more rare answers seem to be the most discriminating, like the present unpackaged and tea shop answers both reported by less than 40 people. In addition, only 9 people report the “Other” category to the question “How” (other options: alone/milk/lemon), yet the category still locates relatively far from the origin. Moreover, it seems that the individuals in the upper right corner associate with answers reporting the consumption of both tea bags and unpackaged tea purchased from both chain stores and tea shops, while individuals consuming mostly tea bags purchased from chain stores locate mostly towards the lower left corner. Based on these observations combined with the variables plot, we can deduce that the answers to the where (where one usually purchases tea, chain store/chain store+tea shop/tea shop) and how (tea bag/tea bag+unpackaged/unpackaged) questions seem to be the most discriminating, the answers to the other questions having quite minimal impact on the dimensions.